llama + spec: MTP Support by am17an · Pull Request #22673 · ggml-org/llama.cpp

am17an · 2026-05-04T09:41:20Z

Overview

This PR adds support for MTP (Multi Token Prediction) heads. I tested this on Qwen3.6 27B and Qwen3.6 35BA3B but in principle it should work for any MTP model. I've posted the detailed results below, but typically I see a steady-state acceptance of around 75% with 3 draft tokens, which is more than >2x speed-up over baseline. The design decisions I took to get to this stage are as follows:

The MTP model is a separate model which loads from the same GGUF, the idea is that MTP should automatically start and we shouldn't need to distribute the MTP gguf separately but also it has it's own context/kv-cache etc.
I saw a problem in [Speculative decoding] feat: add EAGLE3 speculative decoding support #18039 where the hidden features weren't propagated correctly across multiple ubatches, so this PR adds a separate "hook" for the MTP to consume after each ubatch
The MTP speculative class is fairly trivial (although it does depend on llama: allow partial seq_rm for GDN models for speculative decoding #22400, but could work without it)

Next Steps

Wait for spec : parallel drafting support #22838
Support separate GGUF for mtp

Performance

A simple bench for testing various prompts is here: https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090. Posting the results below:

Performance on DGX Spark 🧵

No MTP (baseline)

./llama-server -m ../qwen3.6-q8_0.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.0
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.3
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.3
  summarize          pred=  53 draft=   0 acc=   0 rate=n/a tok/s=7.1
  qa_factual         pred= 177 draft=   0 acc=   0 rate=n/a tok/s=7.0
  translation        pred=  22 draft=   0 acc=   0 rate=n/a tok/s=7.7
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.1
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.2
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1404,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 201.07
}

MTP --spec-draft-max-n 3

./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 3

  code_python        pred= 192 draft= 153 acc= 139 rate=0.908 tok/s=21.6
  code_cpp           pred= 192 draft= 176 acc= 132 rate=0.750 tok/s=18.7
  explain_concept    pred= 192 draft= 191 acc= 126 rate=0.660 tok/s=16.3
  summarize          pred=  55 draft=  51 acc=  37 rate=0.726 tok/s=17.9
  qa_factual         pred= 177 draft= 174 acc= 118 rate=0.678 tok/s=16.5
  translation        pred=  22 draft=  24 acc=  13 rate=0.542 tok/s=13.9
  creative_short     pred= 192 draft= 200 acc= 123 rate=0.615 tok/s=15.8
  stepwise_math      pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=19.3
  long_code_review   pred= 192 draft= 179 acc= 131 rate=0.732 tok/s=18.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1406,
  "total_draft": 1319,
  "total_draft_accepted": 952,
  "aggregate_accept_rate": 0.7218,
  "wall_s_total": 83.8
}

MTP --spec-draft-max-n 2

./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 2

  code_python        pred= 192 draft= 134 acc= 123 rate=0.918 tok/s=17.4
  code_cpp           pred= 192 draft= 145 acc= 118 rate=0.814 tok/s=16.5
  explain_concept    pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=16.1
  summarize          pred=  55 draft=  44 acc=  32 rate=0.727 tok/s=15.6
  qa_factual         pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=18.2
  translation        pred=  22 draft=  18 acc=  12 rate=0.667 tok/s=15.2
  creative_short     pred= 192 draft= 149 acc= 116 rate=0.778 tok/s=16.1
  stepwise_math      pred= 192 draft= 139 acc= 121 rate=0.871 tok/s=17.2
  long_code_review   pred= 192 draft= 153 acc= 114 rate=0.745 tok/s=15.6

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1421,
  "total_draft": 1062,
  "total_draft_accepted": 877,
  "aggregate_accept_rate": 0.8258,
  "wall_s_total": 90.44
}

Draft model (Qwen3.5 0.8B) with spec-draft-n-max 16 with partial rollback

llama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 16 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft= 188 acc= 156 rate=0.830 tok/s=26.4
  code_cpp           pred= 192 draft= 201 acc= 126 rate=0.627 tok/s=16.8
  explain_concept    pred= 192 draft= 263 acc= 112 rate=0.426 tok/s=12.7
  summarize          pred=  57 draft=  63 acc=  39 rate=0.619 tok/s=16.9
  qa_factual         pred= 192 draft= 178 acc= 177 rate=0.994 tok/s=47.7
  translation        pred=  23 draft=  18 acc=  15 rate=0.833 tok/s=18.7
  creative_short     pred= 192 draft= 189 acc= 120 rate=0.635 tok/s=15.4
  stepwise_math      pred= 192 draft= 190 acc= 148 rate=0.779 tok/s=22.3
  long_code_review   pred= 192 draft= 207 acc= 120 rate=0.580 tok/s=14.5

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1424,
  "total_draft": 1497,
  "total_draft_accepted": 1013,
  "aggregate_accept_rate": 0.6767,
  "wall_s_total": 81.39
}

Master with draft model with spec-draft-n-max 64 with no partial rollback

llama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 64 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft= 174 acc= 159 rate=0.914 tok/s=27.2
  code_cpp           pred= 192 draft= 138 acc= 120 rate=0.870 tok/s=15.0
  explain_concept    pred= 192 draft= 170 acc= 101 rate=0.594 tok/s=11.4
  summarize          pred=  55 draft=  48 acc=  36 rate=0.750 tok/s=14.6
  qa_factual         pred= 177 draft= 126 acc= 106 rate=0.841 tok/s=13.9
  translation        pred=  22 draft=  13 acc=  13 rate=1.000 tok/s=16.5
  creative_short     pred= 192 draft= 136 acc= 104 rate=0.765 tok/s=12.8
  stepwise_math      pred= 192 draft= 172 acc= 147 rate=0.855 tok/s=22.0
  long_code_review   pred= 192 draft= 160 acc= 111 rate=0.694 tok/s=13.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1406,
  "total_draft": 1137,
  "total_draft_accepted": 897,
  "aggregate_accept_rate": 0.7889,
  "wall_s_total": 97.13
}

How to use

I've uploaded the GGUF which I made by using the convert_hf_to_gguf.py changes in this PR. Here is another GGUF for the MoE (35BA3B) model

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, for debugging and reviewing. Also the convert_hf_to_gguf.py + model definitions. Writing bench for validation against vLLM.

ngxson · 2026-05-04T10:08:48Z

Nice, I think this is a fresh start better than my WIP #18886 (that I still never find the time to continue)

There were some other attempts to add MTP support but they all heavily rely on host <--> device data copy. I assume you tried addressed this, right? (Maybe there was a discussion somewhere but I wasn't aware of)

ngxson

(not a review, but opening some discussions)

ngxson · 2026-05-04T10:13:52Z


-    for (int il = 0; il < n_layer; ++il) {
+    // MTP/NextN layers are loaded as extra decoder blocks but not executed in the main pass.
+    const int n_transformer_layers = n_layer - (int)hparams.nextn_predict_layers;


nits, but maybe call it n_main_layers, as technically nextn layer is also a transformer layer

ngxson · 2026-05-04T10:18:08Z

+        //TODO: generalize if this is ok, we should load <arch_name>_mtp arch?
+        if (params_base.speculative.type == COMMON_SPECULATIVE_TYPE_MTP) {
+            SRV_INF("loading MTP head from '%s' (override_arch=qwen35_mtp)\n",
+                    params_base.model.path.c_str());
+
+            auto mparams_mtp = common_model_params_to_llama(params_base);
+            mparams_mtp.override_arch = "qwen35_mtp";
+
+            model_mtp.reset(llama_model_load_from_file(params_base.model.path.c_str(), mparams_mtp));
+            if (model_mtp == nullptr) {
+                SRV_ERR("failed to load MTP head from '%s'\n", params_base.model.path.c_str());
+                return false;
+            }


if you look at #18886, the better way is to move llama_graph_type to the public API, then load the context with the appropriate graph type

Yes that seems like the correct way to do this if we want to support MTP in a generic way

am17an · 2026-05-04T10:25:00Z

@ngxson yes the h2d was discussed with GG, he's working on a refactor which will allow us to share tensors between two llama context

pwilkin · 2026-05-04T10:41:19Z

Great work, this should massively bridge the TG gap with vLLM, or maybe even surpass it together with tensor-parallel.

cmp-nct · 2026-05-04T13:17:27Z

in my opinion Qwen 3.6 is the most important thing that happened in open source models in a long time, this is going to be so valuable.
I wonder if this, once merged, could be combined with ngram drafting ?
So MTP is used until ngram is triggered - switching to ngram until rejection and back to MTP

ngram could be set to match only very strong and long candidates - for large repetitive paraphrasing
and MTP fills the gap

Dampfinchen · 2026-05-04T13:18:48Z

" idea is that MTP should automatically start and we shouldn't need to distribute the MTP gguf separately but also it has it's own context/kv-cache etc." -> Does this mean MTP needs additional resources (RAM/VRAM?)

If so, there should always be an option to remain to disable it. Right now on my system (6 GB VRAM, 32 GB RAM), speculative decoding just makes things much slower even on very small draft models because of that exact reason, they need own context and kv-cache. Such low to midrange systems already operate on the edge in terms of memory.

mbednarek360 · 2026-05-04T13:31:26Z

I'm getting garbage responses running this PR on the Vulkan backend with an R9700 using llama-server. I'm using the GGUF you linked above. Interestingly, draft acceptance is only 0.01282.

Prompt: "Hello!"
Response:

The from,

;::...

... on;srible威风to{ islitor

\ ...

• We
&eq和chn ***, on
Prompt (:
mouth

“ ? forM� P

am17an · 2026-05-04T13:35:31Z

@cmp-nct I'm not sure, but could be possible

@Dampfinchen as of right now it is opt-in via --spec-type mtp, but in terms of memory it should be < 10% of overall memory used (it's just a single layer transformer + kv cache, much lighter than draft models)

@mbednarek360 I've only tested this on a small number of CUDA devices as of now, once it's ready to review I would have tested more devices/backends. In particular this PR relies on #22400 which is not implemented for vulkan for now, if you ask an LLM to add support for that you might get a little further Vulkan and Metal also tested

nawoa · 2026-05-04T15:21:12Z

Might it be possible/useful to run the draft model on a second GPU? Given that MTP weights model are relatively small this might provide a useful speedup on systems with a dedicated high-VRAM "AI" GPU with a cheaper low-VRAM "normal" GPU used for display output, etc... possibly prevent some degree of resource contention.

cturan · 2026-05-04T15:25:48Z

Thank you, we are eagerly awaiting this to become stable, here automated test results for my machine;

__
Qwen3.6-27B Q6_K benchmark on llama.cpp b9025-10829dbcc / PR #22673 branch
Hardware: RTX 3090 24GB + RTX 3060 12GB
Runtime flags: -fa on -c 10000 -np 1 -ngl 99 --no-mmap --no-cache-prompt
Endpoint: /completion, raw text prompt
Prompt: 6978 tokens
Generation: 256 tokens
Runs: 3 measured runs after warmup

mode	model	prefill tok/s avg	generation tok/s avg	MTP acceptance	loaded VRAM
MTP enabled	Qwen3.6-27B-MTP-Q6_K.gguf + `--spec-type mtp --spec-draft-n-max 3`	665.14	42.45	76.0%	24.96 GiB
MTP disabled, same GGUF	Qwen3.6-27B-MTP-Q6_K.gguf, no spec	1315.46	22.97	n/a	22.47 GiB
Existing non-MTP Q6	Qwen3.6-27B-Q6_K.gguf, no spec	1260.12	22.39	n/a	22.59 GiB

Result:

MTP improves decode from 22.97 tok/s to 42.45 tok/s on the same GGUF: ~1.85x speedup.
Against the existing non-MTP Q6 file, decode improves from 22.39 tok/s to 42.45 tok/s: ~1.90x speedup.
Prefill is slower with MTP enabled in this PR path: 665 tok/s vs 1315 tok/s on the same GGUF (~0.51x).
MTP adds about 2.49 GiB loaded VRAM in this setup.

am17an · 2026-05-04T15:33:17Z

@cturan Thanks for testing, I'm aware of the issue for the prefill and will work on a fix.

iiLaurens · 2026-05-04T17:41:11Z

Might be a long shot, but any chance of supporting MTP with a reduced vocabulary? MTP layers are rather chonky and reducing token embeddings might help users with less VRAM by filtering out certain languages. Obviously the full model will still be able to produce those tokens if need be so it won't be gimped.

nybblr · 2026-05-04T18:05:54Z

Working on taking this for a spin with the Q4_K_M quant of Qwen3.6-35BA3B. I was gonna try to start from unsloth's quant since they already perform really well, but of course they don't have any mtp layers.

@am17an Think it would work if I just "steal" the layers from your q8 quant and merge them into the unsloth quant? (add blk.40 and bump some top-level config like block_count and kv_count)

volkermauel · 2026-05-04T19:16:44Z

only a quick test run, 1x 5090 qwen3.6-27b mtp 3, q4_0 quantized, kv also q4_0

slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 532 | processing task, is_child = 0
slot update_slots: id  0 | task 532 | new prompt, n_ctx_slot = 200192, n_keep = 0, task.n_tokens = 16
slot update_slots: id  0 | task 532 | n_past = 3, slot.prompt.tokens.size() = 1327, seq_id = 0, pos_min = 1326, n_swa = 0
slot update_slots: id  0 | task 532 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 532 | n_tokens = 0, memory_seq_rm [0, end)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.178.49 200
slot update_slots: id  0 | task 532 | prompt processing progress, n_tokens = 12, batch.n_tokens = 12, progress = 0.750000
slot update_slots: id  0 | task 532 | n_tokens = 12, memory_seq_rm [12, end)
slot init_sampler: id  0 | task 532 | init sampler, took 0.01 ms, tokens: text = 16, total = 16
slot update_slots: id  0 | task 532 | prompt processing done, n_tokens = 16, batch.n_tokens = 4
slot print_timing: id  0 | task 532 |
prompt eval time =������63.16 ms /����16 tokens (����3.95 ms per token,   253.34 tokens per second)
�������eval time =   56063.04 ms /  5913 tokens (����9.48 ms per token,   105.47 tokens per second)
������total time =   56126.20 ms /  5929 tokens
draft acceptance rate = 0.79728 ( 4169 accepted /  5229 generated)
statistics mtp: #calls(b,g,a) = 2 2272 1976, #gen drafts = 2272, #acc drafts = 1976, #gen tokens = 6816, #acc tokens = 4950, dur(b,g,a) = 0.007, 15393.656, 64.921 ms
slot������release: id  0 | task 532 | stop processing: n_tokens = 5928, truncated = 0
srv  update_slots: all slots are idle

same model, same config (except mtp)

slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 16, batch.n_tokens = 4
slot print_timing: id  0 | task 0 | 
prompt eval time =      91.85 ms /    16 tokens (    5.74 ms per token,   174.20 tokens per second)
       eval time =  103127.94 ms /  6571 tokens (   15.69 ms per token,    63.72 tokens per second)
      total time =  103219.79 ms /  6587 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 6586, truncated = 0
srv  update_slots: all slots are idle

prompt „create a flappy bird clone“

(I‘m not creative, sorry)

Great Speedup!

alexandrupetraru · 2026-05-04T23:56:09Z

this is a game changer, on Strix Halo with the q8 Qwen 3.6 35B3A jumping from 40 to 70 tg at low context and for the 27B from 12 to 25 tg(with layer split 7900 xtx and strix halo 50,50) for coding. We need this one to master asap together with turbo4, it performs very well and without any issues. Good job

GloballyUniquePlaceholder · 2026-05-05T01:33:45Z

On a 3060 Laptop 6GB vram + 64GB ram running your provided Qwen 3.6 35A3B gguf there is a reasonable speed up.

spec-draft-n-max	average tk\s	wall_s_total	aggregate_accept_rate
n/a - no mtp	22.92	77.69	n/a
1	27.58	68.34	0.8835
2	29.39	66.00	0.815
3	27.78	67.96	0.7127
4	26.09	72.23	0.6421

raw results

spec-draft-n-max 4

llama.cpp\build\bin\Release\llama-server.exe -fa on -c 5000 -np 1 -fit on -m Qwen3.6-35BA3B-MTP.gguf --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 4

python mtp-bench.py
  code_python        pred= 192 draft= 180 acc= 146 rate=0.811 tok/s=31.3
  code_cpp           pred= 192 draft= 216 acc= 136 rate=0.630 tok/s=22.7
  explain_concept    pred= 192 draft= 224 acc= 134 rate=0.598 tok/s=22.3
  summarize          pred=  53 draft=  52 acc=  39 rate=0.750 tok/s=33.3
  qa_factual         pred= 192 draft= 196 acc= 141 rate=0.719 tok/s=29.2
  translation        pred=  22 draft=  32 acc=  13 rate=0.406 tok/s=19.4
  creative_short     pred= 192 draft= 264 acc= 124 rate=0.470 tok/s=20.7
  stepwise_math      pred= 192 draft= 192 acc= 143 rate=0.745 tok/s=30.7
  long_code_review   pred= 192 draft= 220 acc= 136 rate=0.618 tok/s=25.2

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1419,
  "total_draft": 1576,
  "total_draft_accepted": 1012,
  "aggregate_accept_rate": 0.6421,
  "wall_s_total": 72.23
}

spec-draft-n-max 3

llama.cpp\build\bin\Release\llama-server.exe -fa on -c 5000 -np 1 -fit on -m Qwen3.6-35BA3B-MTP.gguf --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 3

python mtp-bench.py
  code_python        pred= 192 draft= 165 acc= 136 rate=0.824 tok/s=30.2
  code_cpp           pred= 192 draft= 168 acc= 135 rate=0.804 tok/s=27.6
  explain_concept    pred= 192 draft= 189 acc= 128 rate=0.677 tok/s=25.3
  summarize          pred=  53 draft=  48 acc=  36 rate=0.750 tok/s=32.5
  qa_factual         pred= 192 draft= 180 acc= 131 rate=0.728 tok/s=29.2
  translation        pred=  22 draft=  24 acc=  13 rate=0.542 tok/s=24.5
  creative_short     pred= 192 draft= 210 acc= 120 rate=0.571 tok/s=23.2
  stepwise_math      pred= 192 draft= 174 acc= 133 rate=0.764 tok/s=30.5
  long_code_review   pred= 192 draft= 189 acc= 128 rate=0.677 tok/s=27.2

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1419,
  "total_draft": 1347,
  "total_draft_accepted": 960,
  "aggregate_accept_rate": 0.7127,
  "wall_s_total": 67.96
}

spec-draft-n-max 2

llama.cpp\build\bin\Release\llama-server.exe -fa on -c 5000 -np 1 -fit on -m Qwen3.6-35BA3B-MTP.gguf --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 2

python mtp-bench.py
  code_python        pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=31.5
  code_cpp           pred= 192 draft= 140 acc= 120 rate=0.857 tok/s=27.0
  explain_concept    pred= 192 draft= 152 acc= 114 rate=0.750 tok/s=25.6
  summarize          pred=  53 draft=  40 acc=  32 rate=0.800 tok/s=32.2
  qa_factual         pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=31.1
  translation        pred=  22 draft=  16 acc=  13 rate=0.812 tok/s=30.8
  creative_short     pred= 192 draft= 156 acc= 113 rate=0.724 tok/s=25.9
  stepwise_math      pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=31.3
  long_code_review   pred= 192 draft= 146 acc= 117 rate=0.801 tok/s=29.1

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1419,
  "total_draft": 1070,
  "total_draft_accepted": 872,
  "aggregate_accept_rate": 0.815,
  "wall_s_total": 66.0
}

spec-draft-n-max 1

llama.cpp\build\bin\Release\llama-server.exe -fa on -c 5000 -np 1 -fit on -m Qwen3.6-35BA3B-MTP.gguf --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 1

python mtp-bench.py
  code_python        pred= 192 draft=  96 acc=  94 rate=0.979 tok/s=28.3
  code_cpp           pred= 192 draft= 100 acc=  90 rate=0.900 tok/s=26.2
  explain_concept    pred= 192 draft= 102 acc=  89 rate=0.873 tok/s=25.9
  summarize          pred=  56 draft=  29 acc=  26 rate=0.897 tok/s=30.6
  qa_factual         pred= 192 draft= 100 acc=  90 rate=0.900 tok/s=28.5
  translation        pred=  22 draft=  12 acc=   9 rate=0.750 tok/s=27.0
  creative_short     pred= 192 draft= 104 acc=  86 rate=0.827 tok/s=24.9
  stepwise_math      pred= 192 draft= 102 acc=  88 rate=0.863 tok/s=28.7
  long_code_review   pred= 192 draft= 102 acc=  88 rate=0.863 tok/s=28.1

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1422,
  "total_draft": 747,
  "total_draft_accepted": 660,
  "aggregate_accept_rate": 0.8835,
  "wall_s_total": 68.34
}

no mtp

llama.cpp\build\bin\Release\llama-server.exe -fa on -c 5000 -np 1 -fit on -m Qwen3.6-35BA3B-MTP.gguf --chat-template-kwargs "{\"preserve_thinking\": true}"

python mtp-bench.py
  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=22.2
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=22.1
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=22.1
  summarize          pred=  53 draft=   0 acc=   0 rate=n/a tok/s=25.9
  qa_factual         pred= 192 draft=   0 acc=   0 rate=n/a tok/s=22.1
  translation        pred=  22 draft=   0 acc=   0 rate=n/a tok/s=22.3
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=21.4
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=24.0
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=24.2

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1419,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 77.69
}

ninjas28 · 2026-05-05T02:11:03Z

Crashes when using -sm tensor with llama-server launch command args -hf am17an/Qwen3.6-27B-MTP-GGUF:Q8_0 -sm tensor -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 3. Using -sm tensor without MTP works fine. This is on a triple GPU setup using ROCm.

srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
srv  get_availabl: updating prompt cache
srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 262144 tokens, 8589934592 est)
srv  get_availabl: prompt cache update took 0.01 ms
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist 
slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 356
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 352, batch.n_tokens = 352, progress = 0.988764
/root/llama.cpp/ggml/src/ggml-backend-meta.cpp:1013: GGML_ASSERT(split_state.ne[j] * tensor->src[i]->ne[src_ss[i].axis] == sum * tensor->ne[split_state.axis]) failed
/root/llama.cpp/build/bin/libggml-base.so.0(+0x1b25b)[0x74b4b4ca925b]
/root/llama.cpp/build/bin/libggml-base.so.0(ggml_print_backtrace+0x21f)[0x74b4b4ca96df]
/root/llama.cpp/build/bin/libggml-base.so.0(ggml_abort+0x152)[0x74b4b4ca98b2]
/root/llama.cpp/build/bin/libggml-base.so.0(+0x41506)[0x74b4b4ccf506]
/root/llama.cpp/build/bin/libggml-base.so.0(+0x3d579)[0x74b4b4ccb579]
/root/llama.cpp/build/bin/libggml-base.so.0(+0x41adb)[0x74b4b4ccfadb]
/root/llama.cpp/build/bin/libggml-base.so.0(ggml_gallocr_alloc_graph+0x474)[0x74b4b4cbff54]
/root/llama.cpp/build/bin/libggml-base.so.0(ggml_backend_sched_alloc_graph+0x111)[0x74b4b4cc6351]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0xe8)[0x74b4b44dac08]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x37b)[0x74b4b44d912b]
/root/llama.cpp/build/bin/libllama.so.0(llama_decode+0x10)[0x74b4b44da780]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context21handle_mtp_for_ubatchEiPKiS1_P11ggml_tensor+0x20d)[0x74b4b44da9bd]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0x142)[0x74b4b44dac62]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x37b)[0x74b4b44d912b]
/root/llama.cpp/build/bin/libllama.so.0(llama_decode+0x10)[0x74b4b44da780]
llama-server(+0xf846e)[0x63c5e42c046e]
llama-server(+0x172971)[0x63c5e433a971]
llama-server(+0x5842c)[0x63c5e422042c]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x74b4b3c29d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x74b4b3c29e40]
llama-server(+0x58cd5)[0x63c5e4220cd5]
Aborted```

superjamie · 2026-05-05T03:25:09Z

Tested on 3x RTX3060 12Gb. Sorry I don't have the VRAM for your Q8, I used RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF which was quantized with ik_llama's MTP.

Prompt: "Write a simple minimal hash table implementation in C99."

Three runs with no MTP, avg generation 18.51 tok/sec:

llama-server --model /models/RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF/Qwen3.6-27B-MTP-Q4_K_M.gguf \
 --port 8080 --host 0.0.0.0 --n-gpu-layers 999 --flash-attn on --ctx-size $((16*1024)) \
 --temp 0.6 --top-p 0.95 --presence-penalty 0.0 --top-k 20 --min-p 0.0 --repeat_penalty 1.0 \
 --no-mmproj --chat-template-kwargs '{"enable_thinking":false}'

prompt eval time =     177.62 ms /    24 tokens (    7.40 ms per token,   135.12 tokens per second)
       eval time =   99331.08 ms /  1837 tokens (   54.07 ms per token,    18.49 tokens per second)
      total time =   99508.70 ms /  1861 tokens

prompt eval time =     159.10 ms /    24 tokens (    6.63 ms per token,   150.85 tokens per second)
       eval time =  107505.42 ms /  1988 tokens (   54.08 ms per token,    18.49 tokens per second)
      total time =  107664.52 ms /  2012 tokens

prompt eval time =     158.43 ms /    24 tokens (    6.60 ms per token,   151.49 tokens per second)
       eval time =   48263.07 ms /   895 tokens (   53.93 ms per token,    18.54 tokens per second)
      total time =   48421.51 ms /   919 tokens

Three runs with MTP, avg generation 32.24 tok/sec:

llama-server --model /models/RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF/Qwen3.6-27B-MTP-Q4_K_M.gguf \
 --port 8080 --host 0.0.0.0 --n-gpu-layers 999 --flash-attn on --ctx-size $((16*1024)) \
 --temp 0.6 --top-p 0.95 --presence-penalty 0.0 --top-k 20 --min-p 0.0 --repeat_penalty 1.0 \
 --no-mmproj --chat-template-kwargs '{"enable_thinking":false}' \
 --spec-type mtp --spec-draft-n-max 3 --parallel 1

prompt eval time =     232.24 ms /    24 tokens (    9.68 ms per token,   103.34 tokens per second)
       eval time =   34610.94 ms /  1110 tokens (   31.18 ms per token,    32.07 tokens per second)
      total time =   34843.18 ms /  1134 tokens 
      
prompt eval time =     207.99 ms /    24 tokens (    8.67 ms per token,   115.39 tokens per second)
       eval time =   32110.05 ms /  1064 tokens (   30.18 ms per token,    33.14 tokens per second)
      total time =   32318.03 ms /  1088 tokens
      
prompt eval time =     208.50 ms /    24 tokens (    8.69 ms per token,   115.11 tokens per second)
       eval time =   39029.34 ms /  1230 tokens (   31.73 ms per token,    31.51 tokens per second)
      total time =   39237.84 ms /  1254 tokens

Result 74% speedup. Wow!

Thank you for your work. You will make many users happy with this. What an exciting PR!

One small hiccup. On my initial attempt I got the error message:

load_model: MTP currently supports only n_parallel=1; got 4

Adding --parallel 1 fixed that.

i386 · 2026-05-14T11:58:01Z

gone ahead and implemented metal backend support for this am17an#10

ggerganov · 2026-05-14T12:23:54Z

@pepedombo Could you try bumping the batch size to -b 8192 -ub 512 and see if it helps with the PP?

pepedombo · 2026-05-14T12:31:56Z

@pepedombo Could you try bumping the batch size to -b 8192 -ub 512 and see if it helps with the PP?

Already tried various batches and it simply goes with constant speed.

prompt eval time = 27609.50 ms / 18189 tokens ( 1.52 ms per token, 658.79 tokens per second)
eval time = 1852.60 ms / 63 tokens ( 29.41 ms per token, 34.01 tokens per second)
total time = 29462.10 ms / 18252 tokens

Evals from qwen code. Without mtp I get ~1300-1400pp and 22-26 tg.

ggerganov · 2026-05-14T12:36:18Z

Ok, thanks for the info. We'll focus on the PP improvements after the merge.

DenysAshikhin · 2026-05-14T12:44:15Z

With Vulkan and a25be1b, there seems to be an error when trying to combine mtp and ngram, not sure if this is supposed to work here yet, but just leaving it here in case it is:

init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:
 - the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 2065
 - the tokens for sequence 0 in the input batch have a starting position of Y = 2010
 for M-RoPE, it is required that the position satisfies: X < Y
decode: failed to initialize batch
llama_decode: failed to decode, ret = -1
srv  update_slots: Invalid input batch. i = 0, n_batch = 2048, ret = -1
srv    send_error: task id = 0, error: Invalid input batch.

Command:

llama-server -hf am17an/Qwen3.6-35BA3B-MTP-GGUF --host 0.0.0.0 --port 8080 --no-mmap --fit off --spec-type draft-mtp,ngram-mod --spec-draft-n-max 3 --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 48 --spec-ngram-mod-n-max 64 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 --parallel 1

I apoligise if this isn't the place to ask but how exactly does draft + ngram work?
--spec-draft-n-max 3 -> Means the mtp heads will generate 3 tokens
--ngram params -> In this case look at last 24 tokens, generate at least 48 (if it can) up to a max of 64.

Then what it means in practise it will try to first do a ngram acceptance, and only if that fails (meaning we didn't get 48 - 64 tokens accepted) then it tries the mtp approach?

Neppord · 2026-05-14T14:07:11Z

~~When i try this branch i get an error during compile (a missing ";"). and after fixing that~~ I get this error, when trying to run

./build/bin/llama-server --host 0.0.0.0 --port 44444 -hf unsloth/Qwen3.6-27B-MTP-GGUF --no-mmproj --alias qwen --reasoning on -c 8192 -ngl 99 -fa on --jinja -np 1 -b 8192 -ub 1024 --cache-type-k q4_0 --cache-type-v q4_0

[...lots of output...]

.../llama.cpp/src/llama-memory-recurrent.cpp:173: GGML_ASSERT(rollback >= 1 && rollback <= (llama_pos) n_rs_seq) failed

[...more output...]

Im runing a Intel Arc Pro B70, have also tried to merge in master to see if that helped, no real change. im using the SYCL backend

* MTP: clean-up * review: use llama_context_type instead of llama_graph_type * review: remove llama_model_has_mtp * review: fix convert issues * convert: fix pycheck * review: formatting * use `mtp-` for identifying mtp models * convert: fix mtp conversion

Currently speculative checkpoint needs to restart from a checkpoint after some draft tokens are not accepted, this leads to some wastage in running the target again. This PR adds the ability to rollback upto `draft_max` by storing the GDN intermediates.

Extend the gated delta net kernel to store intermediate states for partial rollback support on the Metal backend. - Add K (snapshot slot count) as a function constant - Read input state from slot 0 of the 3D state tensor - Write intermediate states to different slots during token loop - For K=1, maintain backward-compatible single-slot behavior Ref: ggml-org@8c05923 Assisted-by: llama.cpp:local pi

Neppord · 2026-05-14T14:46:55Z

Wow, im so impressed by your work!

* server : adjust checkpoint logic * cont : rm asserts

unbug · 2026-05-14T15:20:59Z

Please keep SM70 V100 surported.

github-actions Bot added model Model specific testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend examples python python script changes server ggml changes relating to the ggml tensor library for machine learning labels May 4, 2026

ngxson reviewed May 4, 2026

View reviewed changes

am17an force-pushed the mtp-clean branch from 6b40a9f to 10829db Compare May 4, 2026 12:33

wjy9902 mentioned this pull request May 4, 2026

2026-05-05 wjy9902/ai-daily#58

Open

MirkoCovizzi mentioned this pull request May 4, 2026

Potential 2x decoding speedup with MTP antirez/llama.cpp-deepseek-v4-flash#5

Open

github-actions Bot mentioned this pull request May 5, 2026

Reddit News Daily 2026-05-05 gitlawr/reddit-daily-news#235

Open

peculiar-ragdoll mentioned this pull request May 14, 2026

Feature Request: per-request speculative decoding toggle in llama-server #23052

Open

am17an force-pushed the mtp-clean branch from a8a33f6 to 5060c92 Compare May 14, 2026 13:32

am17an and others added 18 commits May 14, 2026 22:19

spec: support MTP

2e268cd

fix batch size

cbf76ed

rename files

6fb57a1

cont : simplify (#7)

1aac997

MTP: clean-up (#9)

8190e44

* MTP: clean-up * review: use llama_context_type instead of llama_graph_type * review: remove llama_model_has_mtp * review: fix convert issues * convert: fix pycheck * review: formatting * use `mtp-` for identifying mtp models * convert: fix mtp conversion

mtp -> draft-mtp

655c577

remove unused llama_arch

66b3e28

add need_embd in speculative

1807e11

fix pending state

b4c733c

vulkan: add GDN partial rollback

f29cd51

meta: extend check to axis 1

650069e

delta_net_base: use ggml_pad instead of new_tensor

d4e3173

review: add need_rs_seq

e1cadc0

review: rename part_bounded to n_rs

4fc9e48

review: deslop comments

4342a26

review: rename, add asserts

4ef8664

am17an force-pushed the mtp-clean branch from 5060c92 to 4ef8664 Compare May 14, 2026 14:20

server : adjust checkpoint logic (#11)

423ee8c

* server : adjust checkpoint logic * cont : rm asserts

Conversation

am17an commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Next Steps

Performance

No MTP (baseline)

MTP --spec-draft-max-n 3

MTP --spec-draft-max-n 2

Draft model (Qwen3.5 0.8B) with spec-draft-n-max 16 with partial rollback

Master with draft model with spec-draft-n-max 64 with no partial rollback

How to use

Requirements

Uh oh!

ngxson commented May 4, 2026

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngxson May 4, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson May 4, 2026

Choose a reason for hiding this comment

Uh oh!

am17an May 5, 2026

Choose a reason for hiding this comment

Uh oh!

am17an commented May 4, 2026

Uh oh!

pwilkin commented May 4, 2026

Uh oh!

cmp-nct commented May 4, 2026

Uh oh!

Dampfinchen commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbednarek360 commented May 4, 2026

Uh oh!

am17an commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nawoa commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cturan commented May 4, 2026

Uh oh!

am17an commented May 4, 2026

Uh oh!

iiLaurens commented May 4, 2026

Uh oh!

nybblr commented May 4, 2026

Uh oh!

volkermauel commented May 4, 2026

Uh oh!

alexandrupetraru commented May 4, 2026

Uh oh!

GloballyUniquePlaceholder commented May 5, 2026

spec-draft-n-max 4

spec-draft-n-max 3

spec-draft-n-max 2

spec-draft-n-max 1

no mtp

Uh oh!

ninjas28 commented May 5, 2026

Uh oh!

superjamie commented May 5, 2026

Uh oh!

i386 commented May 14, 2026

Uh oh!

ggerganov commented May 14, 2026

Uh oh!

pepedombo commented May 14, 2026

Uh oh!

ggerganov commented May 14, 2026

Uh oh!

DenysAshikhin commented May 14, 2026

Uh oh!

Neppord commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

am17an commented May 4, 2026 •

edited

Loading

Dampfinchen commented May 4, 2026 •

edited

Loading

am17an commented May 4, 2026 •

edited

Loading

nawoa commented May 4, 2026 •

edited

Loading

Neppord commented May 14, 2026 •

edited

Loading